Goto

Collaborating Authors

 setup 2


Robustness of Refugee-Matching Gains to Off-Policy Evaluation Choices

arXiv.org Machine Learning

Previous research has investigated the potential of refugee matching for boosting refugee outcomes, first considered by Bansak et al. (2018). This paper demonstrates the stability of counterfactual impact evaluation results in the context of refugee matching in the United States using a range of off-policy evaluation methods. In order to estimate counterfactual impact and test the robustness of our results, we employ several evaluation methods, including inverse probability weighting (IPW) and multiple variants of augmented inverse probability weighting (AIPW). We also consider various modifications, including alternative modeling architectures and different assignment procedures. The impact estimates remain consistent in magnitude in all scenarios as well as statistically significant in most cases. Furthermore, the estimates are also consistent with the results originally presented in Bansak et al. (2018).


Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs

arXiv.org Artificial Intelligence

In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback. In a conversational setting such signals are usually unavailable due to the nature of the interactions, and, instead, the evaluation often relies on crowdsourced evaluation labels. The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied. We focus on how the evaluation of task-oriented dialogue systems (TDSs), is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated. We explore and compare two methodologies for assessing TDSs: one includes the user's follow-up utterance and one without. We use both crowdworkers and large language models (LLMs) as annotators to assess system responses across four aspects: relevance, usefulness, interestingness, and explanation quality. Our findings indicate that there is a distinct difference in ratings assigned by both annotator groups in the two setups, indicating user feedback does influence system evaluation. Workers are more susceptible to user feedback on usefulness and interestingness compared to LLMs on interestingness and relevance. User feedback leads to a more personalized assessment of usefulness by workers, aligning closely with the user's explicit feedback. Additionally, in cases of ambiguous or complex user requests, user feedback improves agreement among crowdworkers. These findings emphasize the significance of user feedback in refining system evaluations and suggest the potential for automated feedback integration in future research. We publicly release the annotated data to foster research in this area.


Fine Tuning Named Entity Extraction Models for the Fantasy Domain

arXiv.org Artificial Intelligence

Named Entity Recognition (NER) is a sequence classification Natural Language Processing task where entities are identified in the text and classified into predefined categories. It acts as a foundation for most information extraction systems. Dungeons and Dragons (D&D) is an open-ended tabletop fantasy game with its own diverse lore. DnD entities are domain-specific and are thus unrecognizable by even the state-of-the-art off-the-shelf NER systems as the NER systems are trained on general data for pre-defined categories such as: person (PERS), location (LOC), organization (ORG), and miscellaneous (MISC). For meaningful extraction of information from fantasy text, the entities need to be classified into domain-specific entity categories as well as the models be fine-tuned on a domain-relevant corpus. This work uses available lore of monsters in the D&D domain to fine-tune Trankit, which is a prolific NER framework that uses a pre-trained model for NER. Upon this training, the system acquires the ability to extract monster names from relevant domain documents under a novel NER tag. This work compares the accuracy of the monster name identification against; the zero-shot Trankit model and two FLAIR models. The fine-tuned Trankit model achieves an 87.86% F1 score surpassing all the other considered models.


Domain-specific transfer learning in the automated scoring of tumor-stroma ratio from histopathological images of colorectal cancer

arXiv.org Artificial Intelligence

Tumor-stroma ratio (TSR) is a prognostic factor for many types of solid tumors. In this study, we propose a method for automated estimation of TSR from histopathological images of colorectal cancer. The method is based on convolutional neural networks which were trained to classify colorectal cancer tissue in hematoxylin-eosin stained samples into three classes: stroma, tumor and other. The models were trained using a data set that consists of 1343 whole slide images. Three different training setups were applied with a transfer learning approach using domain-specific data i.e. an external colorectal cancer histopathological data set. The three most accurate models were chosen as a classifier, TSR values were predicted and the results were compared to a visual TSR estimation made by a pathologist. The results suggest that classification accuracy does not improve when domain-specific data are used in the pre-training of the convolutional neural network models in the task at hand. Classification accuracy for stroma, tumor and other reached 96.1$\%$ on an independent test set. Among the three classes the best model gained the highest accuracy (99.3$\%$) for class tumor. When TSR was predicted with the best model, the correlation between the predicted values and values estimated by an experienced pathologist was 0.57. Further research is needed to study associations between computationally predicted TSR values and other clinicopathological factors of colorectal cancer and the overall survival of the patients.


Domain-Specific Text Generation for Machine Translation

arXiv.org Artificial Intelligence

Preservation of domain knowledge from the source to target is crucial in any translation workflow. It is common in the translation industry to receive highly specialized projects, where there is hardly any parallel in-domain data. In such scenarios where there is insufficient in-domain data to fine-tune Machine Translation (MT) models, producing translations that are consistent with the relevant context is challenging. In this work, we propose a novel approach to domain adaptation leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation for MT, simulating the domain characteristics of either (a) a small bilingual dataset, or (b) the monolingual source text to be translated. Combining this idea with back-translation, we can generate huge amounts of synthetic bilingual in-domain data for both use cases. For our investigation, we use the state-of-the-art Transformer architecture. We employ mixed fine-tuning to train models that significantly improve translation of in-domain texts. More specifically, in both scenarios, our proposed methods achieve improvements of approximately 5-6 BLEU and 2-3 BLEU, respectively, on the Arabic-to-English and English-to-Arabic language pairs. Furthermore, the outcome of human evaluation corroborates the automatic evaluation results.


Semantic Search as Extractive Paraphrase Span Detection

arXiv.org Artificial Intelligence

In this paper, we approach the problem of semantic search by framing the search task as paraphrase span detection, i.e. given a segment of text as a query phrase, the task is to identify its paraphrase in a given document, the same modelling setup as typically used in extractive question answering. On the Turku Paraphrase Corpus of 100,000 manually extracted Finnish paraphrase pairs including their original document context, we find that our paraphrase span detection model outperforms two strong retrieval baselines (lexical similarity and BERT sentence embeddings) by 31.9pp and 22.4pp respectively in terms of exact match, and by 22.3pp and 12.9pp in terms of token-level F-score. This demonstrates a strong advantage of modelling the task in terms of span retrieval, rather than sentence similarity. Additionally, we introduce a method for creating artificial paraphrase data through back-translation, suitable for languages where manually annotated paraphrase resources for training the span detection model are not available.


Persistent Reductions in Regularized Loss Minimization for Variable Selection

arXiv.org Machine Learning

In the context of regularized loss minimization with polyhedral gauges, we show that for a broad class of loss functions (possibly non-smooth and non-convex) and under a simple geometric condition on the input data it is possible to efficiently identify a subset of features which are guaranteed to have zero coefficients in all optimal solutions in all problems with loss functions from said class, before any iterative optimization has been performed for the original problem. This procedure is standalone, takes only the data as input, and does not require any calls to the loss function. Therefore, we term this procedure as a persistent reduction for the aforementioned class of regularized loss minimization problems. This reduction can be efficiently implemented via an extreme ray identification subroutine applied to a polyhedral cone formed from the datapoints. We employ an existing output-sensitive algorithm for extreme ray identification which makes our guarantee and algorithm applicable in ultra-high dimensional problems.


To update or not to update? Delayed Nonparametric Bandits with Randomized Allocation

arXiv.org Machine Learning

Contextual bandits provide a natural framework to model a lot of practical sequential decision making problems in various fields. Woodroofe (1979) started studying multiarmed bandit problems with side information in a parametric framework, and Yang and Zhu (2002) initiated an investigation from a nonparametric perspective. See Lai (2001);Bartroff et al. (2008) for reviews on general sequential problems and Bubeck and Cesa-Bianchi (2012) for bandits exclusively. In recent years, bandit problems have gained popularity and have been studied extensively under different names, such as contextual bandits, multi-armed bandits with covariates (MABC), associative bandit problems and multi-armed bandits with side information. For example, when treating patients of a disease, the doctor needs to decide which treatment amongst several competing treatments would be the best for the current patient, given the patient's covariate information and data available from previous patients. Most of the bandit algorithms assume instantaneous observance of rewards, but in most practical situations, rewards are only obtained at some delayed time. For example, it is often the case that several other patients have to be treated before the outcome for the current patient is observed. One way to tackle this problem is to adopt black-box procedures incorporating delayed rewards using the already existing no-delay policies in the stochastic bandits setting.


D-GCCA: Decomposition-based Generalized Canonical Correlation Analysis for Multiple High-dimensional Datasets

arXiv.org Machine Learning

Such studies include The Cancer Genome Atlas (TCGA; Hoadley et al., 2018) with multi-platform genomic data for tumor samples, and Human Connectome Project (HCP; Van Essen et al., 2013) with multi-modal brain images of healthy adults, among many others (Crawford et al., 2016; Jensen et al., 2017). The use of multiple data types can allow us to enhance understanding the etiology of many complex diseases, such as cancers (Ciriello et al., 2015; Campbell et al., 2018) and neurodegenerative diseases (Weiner et al., 2013; Saeed et al., 2017). Researchers hence have became highly interested in studying the shared information and individual features across multi-type datasets through separating their common and distinctive variation structures (van der Kloet et al., 2016; Smilde et al., 2017; Li et al., 2018). Let Y k R p k n be the k -th row-mean centered dataset obtained on a common set of n objects for k 1,...,K, where p k is the number of variables for the k -th dataset. One popular approach for disentangling their common and distinctive variation structures is to decompose each data matrix into Y k X k E k C k D k E k for k 1,...,K, (1) where { X k} K k 1 are low-rank signal matrices with { E k} K k 1 being additive noise matrices, { C k} K k 1 are low-rank common-variation matrices that represent the signal data coming from the common mechanism shared across all datasets, and { D k} K k 1are low-rank distinctive-variation matrices each from the distinctive mechanism of each single dataset that is not shared by all.


Sensitivity Analysis of Deep Neural Networks

arXiv.org Machine Learning

Deep neural networks (DNNs) have achieved superior performance in various prediction tasks, but can be very vulnerable to adversarial examples or perturbations. Therefore, it is crucial to measure the sensitivity of DNNs to various forms of perturbations in real applications. We introduce a novel perturbation manifold and its associated influence measure to quantify the effects of various perturbations on DNN classifiers. Such perturbations include various external and internal perturbations to input samples and network parameters. The proposed measure is motivated by information geometry and provides desirable invariance properties. We demonstrate that our influence measure is useful for four model building tasks: detecting potential 'outliers', analyzing the sensitivity of model architectures, comparing network sensitivity between training and test sets, and locating vulnerable areas. Experiments show reasonably good performance of the proposed measure for the popular DNN models ResNet50 and DenseNet121 on CIFAR10 and MNIST datasets.